Unsupervised Machine Learning Algorithms Final Project¶
Customer Segmentation using Clustering¶
Carmen Paz¶
1. Main Objective of the Analysis¶
2. Brief Description of the Dataset¶
3. Data Exploration and Cleaning¶
4. Training Unsupervised Model¶
5. Recommended Final Model¶
6. Key Findings and Insights¶
7. Suggestions for Next Steps¶
Data Source¶
Data source citation: Patel, A. (2021, August 22). Customer personality analysis. Kaggle. https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis/data marketing_campaign.csv
The features in the data are described as follows:
- ID: Customer's unique identifier
- Year_Birth: Customer's birth year
- Education: Customer's education level
- Marital_Status: Customer's marital status
- Income: Customer's yearly household income
- Kidhome: Number of children in customer's household
- Teenhome: Number of teenagers in customer's household
- Dt_Customer: Date of customer's enrollment with the company
- Recency: Number of days since customer's last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
- NumWebPurchases: Number of purchases made through the company’s website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to company’s website in the last month
- Z_CostContact: same for all rows
- Z_Revenue: same for all rows
Approach
Importing Libraries¶
import numpy as np
import pandas as pd
import datetime
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.axes as axes
from matplotlib import cm
import missingno.missingno as msno
import plotly.express as px
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from datetime import datetime
from warnings import filterwarnings
filterwarnings(action='ignore')
Exploratory Data Analysis¶
We will start by loading in the data and checking the data size.
# load data and check shape
df = pd.read_csv("marketing_campaign.csv", sep="\t")
df.shape
(2240, 29)
Data size: data has 2240 rows and 29 columns
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
Data Cleaning¶
In this notebook, we will clean up the dataset and create a new dataframe with the data we're going to use.
We're going to:
- Find any missing values and decide what to do with rows with missing values, if any
- Simplify categorical data by reducing the number of unique values
- Create new columns based on values from old columns (e.g.
Agecolumn using theYear_Birthcolumn) - Choose certain columns to keep to reduce number of dimensions
Find missing values¶
When we check the info of the data we can see that 24 values in the Income column are missing, there are no other non-null values.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB
We will drop the NA values since there are only 24 of them:
df = df.dropna()
df.shape
(2216, 29)
msno.matrix(df)
plt.show()
missing_rows = df.Income.isna().sum()
total_rows = df.shape[0]
print(f"""
There are {missing_rows} missing data points in the "Income" column.
Those rows make up {round((missing_rows/total_rows)*100, 2)}% of the total dataset.
""")
There are 0 missing data points in the "Income" column. Those rows make up 0.0% of the total dataset.
Customer age¶
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 11 | 0 |
5 rows × 29 columns
df['Age'] = datetime.now().year - df.Year_Birth
sns.histplot(x="Age", data=df, bins=50)
plt.show()
sns.boxplot(x="Age", data=df)
plt.show()
Clean up Education column¶
We will simplify the Education column into "Undergraduate" and "Postgraduate"
Let's check the unique values of feature "Education":
df["Education"].unique()
array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)
df.Education.value_counts()
Education Graduation 1116 PhD 481 Master 365 2n Cycle 200 Basic 54 Name: count, dtype: int64
def edu(x):
education = {
'Basic' :'Undergraduate',
'2n Cycle' :'Undergraduate',
'Graduation':'Postgraduate',
'Master' :'Postgraduate',
'PhD' :'Postgraduate'
}
return education[x]
df['Simplified_Education'] = df.Education.apply(edu)
df[['Education', 'Simplified_Education']].head()
| Education | Simplified_Education | |
|---|---|---|
| 0 | Graduation | Postgraduate |
| 1 | Graduation | Postgraduate |
| 2 | Graduation | Postgraduate |
| 3 | Graduation | Postgraduate |
| 4 | PhD | Postgraduate |
The unique values for Education are 'Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'. Based on information online, 2n Cycle corresponds to master's degree in some european countries but it is not clear if Graduation and Basic correspond to Undergraduate degree, so we will leave the values as they are.
Clean up Marital Status column¶
We will simplify the Marital_Status column values into either "Relationship" or "Alone"
For Marital_Status feature we can see that there are two invalid entries: Absurd and YOLO. We will drop these.
df = df.drop(df[(df["Marital_Status"] == "YOLO") | (df["Marital_Status"] == "Absurd")].index)
# double check the unique values
df["Marital_Status"].unique()
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone'],
dtype=object)
df.Marital_Status.value_counts()
Marital_Status Married 857 Together 573 Single 471 Divorced 232 Widow 76 Alone 3 Name: count, dtype: int64
print(df[df['Marital_Status'].str.contains('YOLO', case=False, na=False)])
Empty DataFrame Columns: [ID, Year_Birth, Education, Marital_Status, Income, Kidhome, Teenhome, Dt_Customer, Recency, MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, AcceptedCmp1, AcceptedCmp2, Complain, Z_CostContact, Z_Revenue, Response, Age, Simplified_Education] Index: [] [0 rows x 31 columns]
df[df['Marital_Status'] == 'YOLO']
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Age | Simplified_Education |
|---|
0 rows × 31 columns
df[df['Marital_Status'] == 'Absurd']
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Age | Simplified_Education |
|---|
0 rows × 31 columns
Dropping the "YOLO" and "Absurd" marital status because we are not sure if that means "Single" or "In a relationship". We do not want to make assumptions that may be incorrect.
df = df[(df.Marital_Status != 'YOLO') & (df.Marital_Status != 'Absurd')]
df.Marital_Status.value_counts()
Marital_Status Married 857 Together 573 Single 471 Divorced 232 Widow 76 Alone 3 Name: count, dtype: int64
def married_single(x):
relationship = {
'Married' :'Relationship',
'Together':'Relationship',
'Single' :'Alone',
'Divorced':'Alone',
'Widow' :'Alone',
'Alone' :'Alone'
}
return relationship[x]
df['Relationship_Status'] = df.Marital_Status.apply(married_single)
df[['Marital_Status', 'Relationship_Status']].head(10)
| Marital_Status | Relationship_Status | |
|---|---|---|
| 0 | Single | Alone |
| 1 | Single | Alone |
| 2 | Together | Relationship |
| 3 | Together | Relationship |
| 4 | Married | Relationship |
| 5 | Together | Relationship |
| 6 | Divorced | Alone |
| 7 | Married | Relationship |
| 8 | Together | Relationship |
| 9 | Together | Relationship |
Clean up the children column¶
We will calculate how many children are in each household and create a new column containing boolean values on whether or not the household has a child.
# Total number of children
df['Num_children'] = df.Kidhome + df.Teenhome
# Has a child (True/False)
df['Has_child'] = df['Num_children'].apply(lambda x: x != 0)
df[['Num_children', 'Has_child']].head()
| Num_children | Has_child | |
|---|---|---|
| 0 | 0 | False |
| 1 | 2 | True |
| 2 | 0 | False |
| 3 | 1 | True |
| 4 | 1 | True |
Rename the products columns¶
We will rename the products columns to something simpler and easier to understand for our analysis. Then we will add the total amount of products purchased.
df = df.rename(columns={
'MntWines' :'Wines',
'MntFruits' :'Fruits',
'MntMeatProducts' :'Meat',
'MntFishProducts' :'Fish',
'MntSweetProducts':'Sweets',
'MntGoldProds' :'Gold'
})
df.iloc[:,9:15].head()
| Wines | Fruits | Meat | Fish | Sweets | Gold | |
|---|---|---|---|---|---|---|
| 0 | 635 | 88 | 546 | 172 | 88 | 88 |
| 1 | 11 | 1 | 6 | 2 | 1 | 6 |
| 2 | 426 | 49 | 127 | 111 | 21 | 42 |
| 3 | 11 | 4 | 20 | 10 | 3 | 5 |
| 4 | 173 | 43 | 118 | 46 | 27 | 15 |
# Sum of products
df['Total'] = df.iloc[:,9:15].sum(axis=1)
df[['Wines', 'Fruits', 'Meat', 'Fish', 'Sweets', 'Gold', 'Total']].head()
| Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|
| 0 | 635 | 88 | 546 | 172 | 88 | 88 | 1617 |
| 1 | 11 | 1 | 6 | 2 | 1 | 6 | 27 |
| 2 | 426 | 49 | 127 | 111 | 21 | 42 | 776 |
| 3 | 11 | 4 | 20 | 10 | 3 | 5 | 53 |
| 4 | 173 | 43 | 118 | 46 | 27 | 15 | 422 |
Income column¶
df.Income.describe()
count 2212.000000 mean 52232.510850 std 25187.455359 min 1730.000000 25% 35233.500000 50% 51381.500000 75% 68522.000000 max 666666.000000 Name: Income, dtype: float64
sns.histplot(x="Income", data=df, color='green')
plt.show()
sns.boxplot(x="Income", data=df, color='green')
plt.show()
Let's check out the high income earners to see if there is any useful information
df[df['Income'] >= 150000]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | Wines | ... | Complain | Z_CostContact | Z_Revenue | Response | Age | Simplified_Education | Relationship_Status | Num_children | Has_child | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 8475 | 1973 | PhD | Married | 157243.0 | 0 | 1 | 01-03-2014 | 98 | 20 | ... | 0 | 3 | 11 | 0 | 52 | Postgraduate | Relationship | 1 | True | 1608 |
| 617 | 1503 | 1976 | PhD | Together | 162397.0 | 1 | 1 | 03-06-2013 | 31 | 85 | ... | 0 | 3 | 11 | 0 | 49 | Postgraduate | Relationship | 2 | True | 107 |
| 655 | 5555 | 1975 | Graduation | Divorced | 153924.0 | 0 | 0 | 07-02-2014 | 81 | 1 | ... | 0 | 3 | 11 | 0 | 50 | Postgraduate | Alone | 0 | False | 6 |
| 687 | 1501 | 1982 | PhD | Married | 160803.0 | 0 | 0 | 04-08-2012 | 21 | 55 | ... | 0 | 3 | 11 | 0 | 43 | Postgraduate | Relationship | 0 | False | 1717 |
| 1300 | 5336 | 1971 | Master | Together | 157733.0 | 1 | 0 | 04-06-2013 | 37 | 39 | ... | 0 | 3 | 11 | 0 | 54 | Postgraduate | Relationship | 1 | True | 59 |
| 1653 | 4931 | 1977 | Graduation | Together | 157146.0 | 0 | 0 | 29-04-2013 | 13 | 1 | ... | 0 | 3 | 11 | 0 | 48 | Postgraduate | Relationship | 0 | False | 1730 |
| 2132 | 11181 | 1949 | PhD | Married | 156924.0 | 0 | 0 | 29-08-2013 | 85 | 2 | ... | 0 | 3 | 11 | 0 | 76 | Postgraduate | Relationship | 0 | False | 8 |
| 2233 | 9432 | 1977 | Graduation | Together | 666666.0 | 1 | 0 | 02-06-2013 | 23 | 9 | ... | 0 | 3 | 11 | 0 | 48 | Postgraduate | Relationship | 1 | True | 62 |
8 rows × 35 columns
We will remove these because there are too few data points to get any meaningful information. These outliers could affect the model.
df = df[df['Income'] <= 150000]
df.Income.describe()
count 2204.000000 mean 51617.730490 std 20721.492888 min 1730.000000 25% 35191.500000 50% 51301.000000 75% 68289.750000 max 113734.000000 Name: Income, dtype: float64
Customer seniority¶
# Convert string to datetime object
df['Dt_Customer'] = df['Dt_Customer'].apply(lambda x: datetime.strptime(x, '%d-%m-%Y').date())
# Calculate the number of days since customers joined
df['days_joined'] = datetime.now().date() - df.Dt_Customer
df['days_joined'] = df.days_joined.apply(lambda x: x.days)
# Convert number of days to approximate years
# Approximate because it does not take into account leap years
df['Seniority'] = round(df.days_joined / 365, 2)
df[['Dt_Customer', 'days_joined', 'Seniority']].head()
| Dt_Customer | days_joined | Seniority | |
|---|---|---|---|
| 0 | 2012-09-04 | 4571 | 12.52 |
| 1 | 2014-03-08 | 4021 | 11.02 |
| 2 | 2013-08-21 | 4220 | 11.56 |
| 3 | 2014-02-10 | 4047 | 11.09 |
| 4 | 2014-01-19 | 4069 | 11.15 |
df.Seniority.describe()
count 2204.000000 mean 11.674678 std 0.554673 min 10.710000 25% 11.200000 50% 11.680000 75% 12.160000 max 12.620000 Name: Seniority, dtype: float64
sns.histplot(x="Seniority", data=df)
plt.show()
Convert data types¶
Income column¶
Convert from float to integer
df.Income.dtype
dtype('float64')
# Check if any data in the income column does not end with a ".0"
print(df.Income.shape[0] - df.Income.astype(str).str.endswith('.0').sum())
0
Since all values in the Income column end with .0, we will convert them to integers
df['Income'] = df.Income.astype(int)
Create new DataFrame with selected columns¶
Let's check the values of our data set:
df.describe()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meat | Fish | ... | AcceptedCmp2 | Complain | Z_CostContact | Z_Revenue | Response | Age | Num_children | Total | days_joined | Seniority | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | ... | 2204.000000 | 2204.000000 | 2204.0 | 2204.0 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 |
| mean | 5586.274501 | 1968.797641 | 51617.730490 | 0.442831 | 0.506352 | 49.052178 | 306.041742 | 26.403811 | 165.318966 | 37.637024 | ... | 0.013612 | 0.009528 | 3.0 | 11.0 | 0.150181 | 56.202359 | 0.949183 | 606.510436 | 4261.257260 | 11.674678 |
| std | 3247.672073 | 11.988415 | 20721.492888 | 0.537220 | 0.544402 | 28.931031 | 337.800997 | 39.830477 | 217.904324 | 54.619808 | ... | 0.115898 | 0.097168 | 0.0 | 0.0 | 0.357330 | 11.988415 | 0.749128 | 602.352523 | 202.488971 | 0.554673 |
| min | 0.000000 | 1893.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 | 29.000000 | 0.000000 | 5.000000 | 3908.000000 | 10.710000 |
| 25% | 2814.750000 | 1959.000000 | 35191.500000 | 0.000000 | 0.000000 | 24.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | ... | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 | 48.000000 | 0.000000 | 69.000000 | 4088.000000 | 11.200000 |
| 50% | 5458.500000 | 1970.000000 | 51301.000000 | 0.000000 | 0.000000 | 49.000000 | 176.000000 | 8.000000 | 68.000000 | 12.000000 | ... | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 | 55.000000 | 1.000000 | 396.000000 | 4263.000000 | 11.680000 |
| 75% | 8418.500000 | 1977.000000 | 68289.750000 | 1.000000 | 1.000000 | 74.000000 | 507.250000 | 33.000000 | 232.000000 | 50.000000 | ... | 0.000000 | 0.000000 | 3.0 | 11.0 | 0.000000 | 66.000000 | 1.000000 | 1045.500000 | 4437.000000 | 12.160000 |
| max | 11191.000000 | 1996.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 1.000000 | 3.0 | 11.0 | 1.000000 | 132.000000 | 3.000000 | 2525.000000 | 4607.000000 | 12.620000 |
8 rows × 31 columns
df.columns
Index(['ID', 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'Wines', 'Fruits', 'Meat', 'Fish',
'Sweets', 'Gold', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response',
'Age', 'Simplified_Education', 'Relationship_Status', 'Num_children',
'Has_child', 'Total', 'days_joined', 'Seniority'],
dtype='object')
temp_df = df[['Age', 'Simplified_Education', 'Relationship_Status', 'Income',
'Seniority', 'Num_children', 'Has_child', 'Wines', 'Fruits',
'Meat', 'Fish', 'Sweets', 'Gold', 'Total']]
temp_df = temp_df.rename(columns={
'Simplified_Education':'Education',
})
temp_df.head()
| Age | Education | Relationship_Status | Income | Seniority | Num_children | Has_child | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 68 | Postgraduate | Alone | 58138 | 12.52 | 0 | False | 635 | 88 | 546 | 172 | 88 | 88 | 1617 |
| 1 | 71 | Postgraduate | Alone | 46344 | 11.02 | 2 | True | 11 | 1 | 6 | 2 | 1 | 6 | 27 |
| 2 | 60 | Postgraduate | Relationship | 71613 | 11.56 | 0 | False | 426 | 49 | 127 | 111 | 21 | 42 | 776 |
| 3 | 41 | Postgraduate | Relationship | 26646 | 11.09 | 1 | True | 11 | 4 | 20 | 10 | 3 | 5 | 53 |
| 4 | 44 | Postgraduate | Relationship | 58293 | 11.15 | 1 | True | 173 | 43 | 118 | 46 | 27 | 15 | 422 |
temp_df.Education.value_counts(normalize=True)
Education Postgraduate 0.884755 Undergraduate 0.115245 Name: proportion, dtype: float64
temp_df.Relationship_Status.value_counts(normalize=True)
Relationship_Status Relationship 0.645644 Alone 0.354356 Name: proportion, dtype: float64
temp_df.Num_children.value_counts(normalize=True)
Num_children 1 0.504537 0 0.284483 2 0.188294 3 0.022686 Name: proportion, dtype: float64
temp_df.describe()
| Age | Income | Seniority | Num_children | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 | 2204.000000 |
| mean | 56.202359 | 51617.730490 | 11.674678 | 0.949183 | 306.041742 | 26.403811 | 165.318966 | 37.637024 | 27.141107 | 43.967786 | 606.510436 |
| std | 11.988415 | 20721.492888 | 0.554673 | 0.749128 | 337.800997 | 39.830477 | 217.904324 | 54.619808 | 41.137498 | 51.659983 | 602.352523 |
| min | 29.000000 | 1730.000000 | 10.710000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 |
| 25% | 48.000000 | 35191.500000 | 11.200000 | 0.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | 1.000000 | 9.000000 | 69.000000 |
| 50% | 55.000000 | 51301.000000 | 11.680000 | 1.000000 | 176.000000 | 8.000000 | 68.000000 | 12.000000 | 8.000000 | 25.000000 | 396.000000 |
| 75% | 66.000000 | 68289.750000 | 12.160000 | 1.000000 | 507.250000 | 33.000000 | 232.000000 | 50.000000 | 34.000000 | 56.000000 | 1045.500000 |
| max | 132.000000 | 113734.000000 | 12.620000 | 3.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | 262.000000 | 321.000000 | 2525.000000 |
sns.histplot(x="Total", data=temp_df, bins=50)
plt.show()
sns.histplot(x="Age", data=temp_df, bins=50)
plt.show()
We can see as the minimum value for Year_Birth 1893. That must be a mistake, so we will filter out birth years less than 1930.
df = df[df['Year_Birth'] >= 1930]
We can also see an outlier for Income, maximum value of 666666, we will filter out Income higher than 600000
df = df[df['Income'] < 600000]
temp_df.head()
| Age | Education | Relationship_Status | Income | Seniority | Num_children | Has_child | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 68 | Postgraduate | Alone | 58138 | 12.52 | 0 | False | 635 | 88 | 546 | 172 | 88 | 88 | 1617 |
| 1 | 71 | Postgraduate | Alone | 46344 | 11.02 | 2 | True | 11 | 1 | 6 | 2 | 1 | 6 | 27 |
| 2 | 60 | Postgraduate | Relationship | 71613 | 11.56 | 0 | False | 426 | 49 | 127 | 111 | 21 | 42 | 776 |
| 3 | 41 | Postgraduate | Relationship | 26646 | 11.09 | 1 | True | 11 | 4 | 20 | 10 | 3 | 5 | 53 |
| 4 | 44 | Postgraduate | Relationship | 58293 | 11.15 | 1 | True | 173 | 43 | 118 | 46 | 27 | 15 | 422 |
Feature Engineering¶
The unique values for Marital_Status are Single, Together, Married, Divorced, Widow, Alone. However we only care if the customer has a partner at home or not, so we will create a new feature Partner_Status and code Married, Partner and Together as 1 and Widow, Divorced, Single and Alone as 0.
df["Partner_Status"]=df["Marital_Status"].replace({"Married": 1, "Partner": 1, "Together": 1,
"Widow": 0, "Divorced": 0, "Single": 0, "Alone": 0})
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | Wines | ... | Response | Age | Simplified_Education | Relationship_Status | Num_children | Has_child | Total | days_joined | Seniority | Partner_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138 | 0 | 0 | 2012-09-04 | 58 | 635 | ... | 1 | 68 | Postgraduate | Alone | 0 | False | 1617 | 4571 | 12.52 | 0 |
| 1 | 2174 | 1954 | Graduation | Single | 46344 | 1 | 1 | 2014-03-08 | 38 | 11 | ... | 0 | 71 | Postgraduate | Alone | 2 | True | 27 | 4021 | 11.02 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613 | 0 | 0 | 2013-08-21 | 26 | 426 | ... | 0 | 60 | Postgraduate | Relationship | 0 | False | 776 | 4220 | 11.56 | 1 |
| 3 | 6182 | 1984 | Graduation | Together | 26646 | 1 | 0 | 2014-02-10 | 26 | 11 | ... | 0 | 41 | Postgraduate | Relationship | 1 | True | 53 | 4047 | 11.09 | 1 |
| 4 | 5324 | 1981 | PhD | Married | 58293 | 1 | 0 | 2014-01-19 | 94 | 173 | ... | 0 | 44 | Postgraduate | Relationship | 1 | True | 422 | 4069 | 11.15 | 1 |
5 rows × 38 columns
df.describe()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meat | Fish | ... | Complain | Z_CostContact | Z_Revenue | Response | Age | Num_children | Total | days_joined | Seniority | Partner_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | ... | 2201.000000 | 2201.0 | 2201.0 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 |
| mean | 5584.809632 | 1968.895048 | 51606.144480 | 0.442980 | 0.506588 | 49.047251 | 306.105407 | 26.371649 | 165.283053 | 37.634711 | ... | 0.009087 | 3.0 | 11.0 | 0.150386 | 56.104952 | 0.949568 | 606.455702 | 4261.468423 | 11.675261 | 0.646070 |
| std | 3246.102409 | 11.701763 | 20721.181896 | 0.537289 | 0.544457 | 28.924487 | 337.778943 | 39.772318 | 217.836140 | 54.630912 | ... | 0.094912 | 0.0 | 0.0 | 0.357531 | 11.701763 | 0.749364 | 601.937286 | 202.505481 | 0.554713 | 0.478296 |
| min | 0.000000 | 1940.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 3.0 | 11.0 | 0.000000 | 29.000000 | 0.000000 | 5.000000 | 3908.000000 | 10.710000 | 0.000000 |
| 25% | 2815.000000 | 1959.000000 | 35178.000000 | 0.000000 | 0.000000 | 24.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | ... | 0.000000 | 3.0 | 11.0 | 0.000000 | 48.000000 | 0.000000 | 69.000000 | 4088.000000 | 11.200000 | 0.000000 |
| 50% | 5455.000000 | 1970.000000 | 51287.000000 | 0.000000 | 0.000000 | 49.000000 | 176.000000 | 8.000000 | 68.000000 | 12.000000 | ... | 0.000000 | 3.0 | 11.0 | 0.000000 | 55.000000 | 1.000000 | 396.000000 | 4263.000000 | 11.680000 | 1.000000 |
| 75% | 8418.000000 | 1977.000000 | 68281.000000 | 1.000000 | 1.000000 | 74.000000 | 507.000000 | 33.000000 | 232.000000 | 50.000000 | ... | 0.000000 | 3.0 | 11.0 | 0.000000 | 66.000000 | 1.000000 | 1045.000000 | 4437.000000 | 12.160000 | 1.000000 |
| max | 11191.000000 | 1996.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | ... | 1.000000 | 3.0 | 11.0 | 1.000000 | 85.000000 | 3.000000 | 2525.000000 | 4607.000000 | 12.620000 | 1.000000 |
8 rows × 32 columns
We will create feature "Age" by substracting Year_Birth from 2024.
df['Age'] = 2024 - df['Year_Birth']
We will drop the features we are not going to use:
cols_drop = ["Marital_Status", "Dt_Customer", "Z_CostContact", "Z_Revenue", "ID", "Year_Birth"]
df = df.drop(cols_drop, axis=1)
df.head()
| Education | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meat | Fish | Sweets | ... | Response | Age | Simplified_Education | Relationship_Status | Num_children | Has_child | Total | days_joined | Seniority | Partner_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduation | 58138 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 1 | 67 | Postgraduate | Alone | 0 | False | 1617 | 4571 | 12.52 | 0 |
| 1 | Graduation | 46344 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 0 | 70 | Postgraduate | Alone | 2 | True | 27 | 4021 | 11.02 | 0 |
| 2 | Graduation | 71613 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 0 | 59 | Postgraduate | Relationship | 0 | False | 776 | 4220 | 11.56 | 1 |
| 3 | Graduation | 26646 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 0 | 40 | Postgraduate | Relationship | 1 | True | 53 | 4047 | 11.09 | 1 |
| 4 | PhD | 58293 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | ... | 0 | 43 | Postgraduate | Relationship | 1 | True | 422 | 4069 | 11.15 | 1 |
5 rows × 32 columns
Data Visualizations¶
Let's chack the distributions for age, income and total spent.
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes = axes.flatten()
sns.histplot(data=df, x="Age", bins=20, ax=axes[0])
sns.histplot(data=df, x="Income", bins=20, ax=axes[1])
sns.histplot(data=df, x="Total", bins=20, ax=axes[2])
#sns.histplot(data=df, x="Total_Spent", bins=20, ax=axes[2])
plt.show()
We can see that Age is roughly normally distributed, there are still some outliers in the high income group and there is a high number of customers with very low total spending.
Next we will check demographic information of the customers, their education, partner status, how many kids they have at home and how many teenagers they have at home.
fig, axes = plt.subplots(2, 2, figsize=(15, 9))
axes = axes.flatten()
sns.countplot(data=df, x="Education", ax=axes[0])
sns.countplot(data=df, x="Partner_Status", ax=axes[1])
sns.countplot(data=df, x="Kidhome", ax=axes[2])
sns.countplot(data=df, x="Teenhome", ax=axes[3])
<Axes: xlabel='Teenhome', ylabel='count'>
Graduation is the highest education category, most customers have a partner. Many customers do not have a kid or teenager at home, many have one kid or one teenager and very few have two.
Correlation between features¶
Now we are going to check the correlation between features with a correlation matrix heat map. We can see that there is correlation between many features.
cor = df[df.columns[df.dtypes != 'object']].corr()
plt.figure(figsize=(20,20))
sns.heatmap(cor, annot=True)
plt.savefig('corheat.png', dpi=300)
Let's check the relationship between income and total spent. We can see that there is a strong relationship.
# 'Total'='Total_Spent'
plt.figure(figsize=(8, 4))
sns.regplot(x='Income', y='Total', data=df, line_kws={'color': 'red'})
plt.title('Income vs Total')
axes = plt.gca()
axes.set_ylim(0, 3500)
plt.xlabel('Income')
plt.ylabel('Total')
#plt.ylabel('Total_Spent')
plt.show()
Exploratory Data Analysis¶
The following is a list of what we will be looking for:
- View the frequency counts of unique values of categorical data
- View the distributions of numerical data
Missing Values¶
df.head()
| Education | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meat | Fish | Sweets | ... | Response | Age | Simplified_Education | Relationship_Status | Num_children | Has_child | Total | days_joined | Seniority | Partner_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduation | 58138 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 1 | 67 | Postgraduate | Alone | 0 | False | 1617 | 4571 | 12.52 | 0 |
| 1 | Graduation | 46344 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 0 | 70 | Postgraduate | Alone | 2 | True | 27 | 4021 | 11.02 | 0 |
| 2 | Graduation | 71613 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 0 | 59 | Postgraduate | Relationship | 0 | False | 776 | 4220 | 11.56 | 1 |
| 3 | Graduation | 26646 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 0 | 40 | Postgraduate | Relationship | 1 | True | 53 | 4047 | 11.09 | 1 |
| 4 | PhD | 58293 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | ... | 0 | 43 | Postgraduate | Relationship | 1 | True | 422 | 4069 | 11.15 | 1 |
5 rows × 32 columns
msno.matrix(df)
plt.show()
Categorical Data¶
cat_df = df[['Education', 'Relationship_Status', 'Has_child']]
cat_df.head()
| Education | Relationship_Status | Has_child | |
|---|---|---|---|
| 0 | Graduation | Alone | False |
| 1 | Graduation | Alone | True |
| 2 | Graduation | Relationship | False |
| 3 | Graduation | Relationship | True |
| 4 | PhD | Relationship | True |
Education¶
education_count = cat_df['Education'].value_counts(normalize=True)
pd.DataFrame(round(education_count*100, 2))
| proportion | |
|---|---|
| Education | |
| Graduation | 50.52 |
| PhD | 21.54 |
| Master | 16.49 |
| 2n Cycle | 9.00 |
| Basic | 2.45 |
px.bar(education_count)
Relationship Status¶
relationship_count = cat_df['Relationship_Status'].value_counts(normalize=True)
pd.DataFrame(round(relationship_count*100, 2))
| proportion | |
|---|---|
| Relationship_Status | |
| Relationship | 64.61 |
| Alone | 35.39 |
px.bar(relationship_count)
Has Child¶
haschild_count = cat_df['Has_child'].value_counts(normalize=True)
pd.DataFrame(round(haschild_count*100, 2))
| proportion | |
|---|---|
| Has_child | |
| True | 71.56 |
| False | 28.44 |
px.bar(haschild_count)
Numerical Data¶
num_df = df[['Age', 'Income', 'Seniority', 'Num_children', 'Wines',
'Fruits', 'Meat', 'Fish', 'Sweets', 'Gold', 'Total']]
num_df.head()
| Age | Income | Seniority | Num_children | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 67 | 58138 | 12.52 | 0 | 635 | 88 | 546 | 172 | 88 | 88 | 1617 |
| 1 | 70 | 46344 | 11.02 | 2 | 11 | 1 | 6 | 2 | 1 | 6 | 27 |
| 2 | 59 | 71613 | 11.56 | 0 | 426 | 49 | 127 | 111 | 21 | 42 | 776 |
| 3 | 40 | 26646 | 11.09 | 1 | 11 | 4 | 20 | 10 | 3 | 5 | 53 |
| 4 | 43 | 58293 | 11.15 | 1 | 173 | 43 | 118 | 46 | 27 | 15 | 422 |
num_df.describe()
| Age | Income | Seniority | Num_children | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 | 2201.000000 |
| mean | 55.104952 | 51606.144480 | 11.675261 | 0.949568 | 306.105407 | 26.371649 | 165.283053 | 37.634711 | 27.147206 | 43.913676 | 606.455702 |
| std | 11.701763 | 20721.181896 | 0.554713 | 0.749364 | 337.778943 | 39.772318 | 217.836140 | 54.630912 | 41.151010 | 51.543122 | 601.937286 |
| min | 28.000000 | 1730.000000 | 10.710000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000 |
| 25% | 47.000000 | 35178.000000 | 11.200000 | 0.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | 1.000000 | 9.000000 | 69.000000 |
| 50% | 54.000000 | 51287.000000 | 11.680000 | 1.000000 | 176.000000 | 8.000000 | 68.000000 | 12.000000 | 8.000000 | 25.000000 | 396.000000 |
| 75% | 65.000000 | 68281.000000 | 12.160000 | 1.000000 | 507.000000 | 33.000000 | 232.000000 | 50.000000 | 34.000000 | 56.000000 | 1045.000000 |
| max | 84.000000 | 113734.000000 | 12.620000 | 3.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | 262.000000 | 321.000000 | 2525.000000 |
Age¶
sns.histplot(x="Age", data=num_df)
plt.show()
sns.histplot(x="Total", data=num_df, bins=30)
plt.show()
Items¶
item_sum = num_df.iloc[:,4:10].sum(axis=0)
pd.DataFrame(item_sum, columns=['# Sold'])
| # Sold | |
|---|---|
| Wines | 673738 |
| Fruits | 58044 |
| Meat | 363788 |
| Fish | 82834 |
| Sweets | 59751 |
| Gold | 96654 |
sns.histplot(x="Wines", data=num_df)
plt.show()
item_df = pd.DataFrame({
'Item': item_sum.index,
'Amount Sold': item_sum.values
})
item_df['Percentage'] = item_df['Amount Sold'].apply(lambda x: round(((x/num_df['Total'].sum())*100), 2))
# Sort by percentage (descending)
item_df.sort_values(['Percentage'], ascending=False, inplace=True)
item_df
| Item | Amount Sold | Percentage | |
|---|---|---|---|
| 0 | Wines | 673738 | 50.47 |
| 2 | Meat | 363788 | 27.25 |
| 5 | Gold | 96654 | 7.24 |
| 3 | Fish | 82834 | 6.21 |
| 4 | Sweets | 59751 | 4.48 |
| 1 | Fruits | 58044 | 4.35 |
px.bar(item_df, x='Item', y='Amount Sold')
Modeling¶
In this notebook, we are going to use the K-Means algorithm to segment the customers into different clusters.
We will:
- Prepare the dataset for analysis by one hot encoding or label encoding the categorical data
- Choose the number of clusters by looking at the inertia and silhouette score
- Visualize the data from each cluster
df.head()
| Education | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meat | Fish | Sweets | ... | Response | Age | Simplified_Education | Relationship_Status | Num_children | Has_child | Total | days_joined | Seniority | Partner_Status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduation | 58138 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 1 | 67 | Postgraduate | Alone | 0 | False | 1617 | 4571 | 12.52 | 0 |
| 1 | Graduation | 46344 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 0 | 70 | Postgraduate | Alone | 2 | True | 27 | 4021 | 11.02 | 0 |
| 2 | Graduation | 71613 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 0 | 59 | Postgraduate | Relationship | 0 | False | 776 | 4220 | 11.56 | 1 |
| 3 | Graduation | 26646 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 0 | 40 | Postgraduate | Relationship | 1 | True | 53 | 4047 | 11.09 | 1 |
| 4 | PhD | 58293 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | ... | 0 | 43 | Postgraduate | Relationship | 1 | True | 422 | 4069 | 11.15 | 1 |
5 rows × 32 columns
df.dtypes
Education object Income int64 Kidhome int64 Teenhome int64 Recency int64 Wines int64 Fruits int64 Meat int64 Fish int64 Sweets int64 Gold int64 NumDealsPurchases int64 NumWebPurchases int64 NumCatalogPurchases int64 NumStorePurchases int64 NumWebVisitsMonth int64 AcceptedCmp3 int64 AcceptedCmp4 int64 AcceptedCmp5 int64 AcceptedCmp1 int64 AcceptedCmp2 int64 Complain int64 Response int64 Age int64 Simplified_Education object Relationship_Status object Num_children int64 Has_child bool Total int64 days_joined int64 Seniority float64 Partner_Status int64 dtype: object
Convert Categorical Data to Numerical¶
# One Hot Encode Education column
df['Edu'] = df['Education'].apply(lambda x: 1 if x == "Postgraduate" else 0)
df['Child'] = df['Has_child'].apply(lambda x: 0 if x == False else 1)
# Label Encode the Relationship Status column
df['Single'] = df['Relationship_Status'].apply(lambda x: 1 if x == 'Alone' else 0)
df['Relationship'] = df['Relationship_Status'].apply(lambda x: 1 if x == 'Relationship' else 0)
Save new DataFrame with numerical columns
num_df = df[['Age', 'Edu', 'Single', 'Relationship', 'Income',
'Seniority', 'Num_children', 'Child', 'Wines',
'Fruits', 'Meat', 'Fish', 'Sweets', 'Gold', 'Total']]
num_df.head()
| Age | Edu | Single | Relationship | Income | Seniority | Num_children | Child | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 67 | 0 | 1 | 0 | 58138 | 12.52 | 0 | 0 | 635 | 88 | 546 | 172 | 88 | 88 | 1617 |
| 1 | 70 | 0 | 1 | 0 | 46344 | 11.02 | 2 | 1 | 11 | 1 | 6 | 2 | 1 | 6 | 27 |
| 2 | 59 | 0 | 0 | 1 | 71613 | 11.56 | 0 | 0 | 426 | 49 | 127 | 111 | 21 | 42 | 776 |
| 3 | 40 | 0 | 0 | 1 | 26646 | 11.09 | 1 | 1 | 11 | 4 | 20 | 10 | 3 | 5 | 53 |
| 4 | 43 | 0 | 0 | 1 | 58293 | 11.15 | 1 | 1 | 173 | 43 | 118 | 46 | 27 | 15 | 422 |
# Reset index
num_df.reset_index(drop=True, inplace=True)
num_df.dtypes
Age int64 Edu int64 Single int64 Relationship int64 Income int64 Seniority float64 Num_children int64 Child int64 Wines int64 Fruits int64 Meat int64 Fish int64 Sweets int64 Gold int64 Total int64 dtype: object
correlation_df = num_df.corr()
#correlation_df
We want to group customers by how much they spend, so we will choose columns with high correlation values with Total column
# Strong positive correlation with 'Total'
pos_corr = correlation_df['Total'][(correlation_df['Total'] >= 0.5)]
pos_corr
Income 0.823645 Wines 0.897616 Fruits 0.614031 Meat 0.857106 Fish 0.644740 Sweets 0.609521 Gold 0.529779 Total 1.000000 Name: Total, dtype: float64
# Strong negative correlation with 'Total'
neg_corr = correlation_df['Total'][(correlation_df['Total'] <= -0.5)]
neg_corr
Num_children -0.500347 Child -0.522597 Name: Total, dtype: float64
NOTE¶
This is where I remove the Child column because it was heavily impacting my model during my initial attempt. Num_children is enough.
all_df = num_df[neg_corr.index.tolist() + pos_corr.index.tolist()].copy()
X = all_df.drop('Child', axis=1).copy()
X.head()
| Num_children | Income | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 58138 | 635 | 88 | 546 | 172 | 88 | 88 | 1617 |
| 1 | 2 | 46344 | 11 | 1 | 6 | 2 | 1 | 6 | 27 |
| 2 | 0 | 71613 | 426 | 49 | 127 | 111 | 21 | 42 | 776 |
| 3 | 1 | 26646 | 11 | 4 | 20 | 10 | 3 | 5 | 53 |
| 4 | 1 | 58293 | 173 | 43 | 118 | 46 | 27 | 15 | 422 |
Remove the singular items
# X = num_df[['Num_children', 'Child', 'Income', 'Total']]
# X.head()
sns.heatmap(X.corr())
plt.show()
Scale the data between 0 and 1¶
scaler = MinMaxScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
X_scaled.head()
| Num_children | Income | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.503625 | 0.425318 | 0.442211 | 0.316522 | 0.664093 | 0.335878 | 0.274143 | 0.639683 |
| 1 | 0.666667 | 0.398325 | 0.007368 | 0.005025 | 0.003478 | 0.007722 | 0.003817 | 0.018692 | 0.008730 |
| 2 | 0.000000 | 0.623933 | 0.285332 | 0.246231 | 0.073623 | 0.428571 | 0.080153 | 0.130841 | 0.305952 |
| 3 | 0.333333 | 0.222456 | 0.007368 | 0.020101 | 0.011594 | 0.038610 | 0.011450 | 0.015576 | 0.019048 |
| 4 | 0.333333 | 0.505009 | 0.115874 | 0.216080 | 0.068406 | 0.177606 | 0.103053 | 0.046729 | 0.165476 |
K Means Clustering¶
We will use inertia and silhouette scores to determine what the best number of clusters is
clusters = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
scores = {}
inertias = {}
for i in clusters:
kmeans = KMeans(n_clusters=i, n_init=50) # , algorithm='full')
kmeans.fit(X_scaled)
scores[i] = silhouette_score(X_scaled, kmeans.labels_)
inertias[i] = kmeans.inertia_
# Two subplots
fig, (ax1, ax2) = plt.subplots(nrows=2)
# Inertia
ax1.plot(list(inertias.keys()), list(inertias.values()))
ax1.set_title('Inertia')
# Silhouette Score
ax2.plot(list(scores.keys()), list(scores.values()))
ax2.set_title('Silhouette Score')
# Spacing
fig.tight_layout()
# X-Axis Label
plt.xlabel('Number of Clusters')
plt.show()
Credit¶
The code in the following cell comes directly from the Scikit-learn documentation.
cluster_range = [2, 3, 4, 5, 6, 7, 8, 9, 10]
for n_clusters in cluster_range:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(18, 7)
ax1.set_xlim([-0.8, 0.8])
ax1.set_ylim([0, len(X_scaled) + (n_clusters + 1) * 10])
# Cluster
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(X_scaled)
silhouette_avg = silhouette_score(X, cluster_labels)
print(
"For n_clusters =",
n_clusters,
"The average silhouette_score is :",
silhouette_avg,
)
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.8, -0.6, -0.4, -0.3, -0.2, -0.1, 0, 0.2, 0.4, 0.6, 0.8])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(
X_scaled['Total'], X_scaled['Num_children'], marker=".", s=30, lw=0, alpha=0.7, edgecolor="k"
)
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(
centers[:, 0],
centers[:, 1],
marker="o",
c="white",
alpha=1,
s=200,
edgecolor="k",
)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=50, edgecolor="k")
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(
"Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
% n_clusters,
fontsize=14,
fontweight="bold",
)
plt.show()
For n_clusters = 2 The average silhouette_score is : 0.47236915421742814
For n_clusters = 3 The average silhouette_score is : 0.15816357399607045
For n_clusters = 4 The average silhouette_score is : 0.11951188620164022
For n_clusters = 5 The average silhouette_score is : -0.05580554364327878
For n_clusters = 6 The average silhouette_score is : -0.058883794277863166
For n_clusters = 7 The average silhouette_score is : -0.0800796774776505
For n_clusters = 8 The average silhouette_score is : -0.1124454308903397
For n_clusters = 9 The average silhouette_score is : -0.12446301085041107
For n_clusters = 10 The average silhouette_score is : -0.13218092734060352
3 clusters appears to be the best, according to the inertia and silhouette score
kmeans = KMeans(n_clusters=3, n_init=100, algorithm='lloyd', random_state=10)
kmeans.fit(X_scaled)
KMeans(n_clusters=3, n_init=100, random_state=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, n_init=100, random_state=10)
Add the labels to the datasets
X_scaled['labels'] = kmeans.labels_
X['labels'] = kmeans.labels_
X.head()
| Num_children | Income | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | labels | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 58138 | 635 | 88 | 546 | 172 | 88 | 88 | 1617 | 2 |
| 1 | 2 | 46344 | 11 | 1 | 6 | 2 | 1 | 6 | 27 | 1 |
| 2 | 0 | 71613 | 426 | 49 | 127 | 111 | 21 | 42 | 776 | 0 |
| 3 | 1 | 26646 | 11 | 4 | 20 | 10 | 3 | 5 | 53 | 1 |
| 4 | 1 | 58293 | 173 | 43 | 118 | 46 | 27 | 15 | 422 | 0 |
X_scaled.head()
| Num_children | Income | Wines | Fruits | Meat | Fish | Sweets | Gold | Total | labels | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.503625 | 0.425318 | 0.442211 | 0.316522 | 0.664093 | 0.335878 | 0.274143 | 0.639683 | 2 |
| 1 | 0.666667 | 0.398325 | 0.007368 | 0.005025 | 0.003478 | 0.007722 | 0.003817 | 0.018692 | 0.008730 | 1 |
| 2 | 0.000000 | 0.623933 | 0.285332 | 0.246231 | 0.073623 | 0.428571 | 0.080153 | 0.130841 | 0.305952 | 0 |
| 3 | 0.333333 | 0.222456 | 0.007368 | 0.020101 | 0.011594 | 0.038610 | 0.011450 | 0.015576 | 0.019048 | 1 |
| 4 | 0.333333 | 0.505009 | 0.115874 | 0.216080 | 0.068406 | 0.177606 | 0.103053 | 0.046729 | 0.165476 | 0 |
X_scaled.corr()['labels'].sort_values(ascending=False)
labels 1.000000 Fish 0.460942 Meat 0.417765 Fruits 0.402861 Sweets 0.363023 Total 0.321392 Income 0.185322 Wines 0.127312 Gold 0.064167 Num_children -0.296996 Name: labels, dtype: float64
sns.pairplot(X_scaled, hue='labels')
plt.show()
Let's take a closer look at the Total column¶
sns.pairplot(X_scaled, hue='labels', x_vars=['Total'])
plt.show()
Analyze the Results¶
We've cleaned the data and trained a K Means model on that data. Now it's time to explore the results.
Size of the Clusters¶
Generally speaking, we're looking for clusters of similar sizes
df.head()
| Education | Income | Kidhome | Teenhome | Recency | Wines | Fruits | Meat | Fish | Sweets | ... | Num_children | Has_child | Total | days_joined | Seniority | Partner_Status | Edu | Child | Single | Relationship | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Graduation | 58138 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 0 | False | 1617 | 4571 | 12.52 | 0 | 0 | 0 | 1 | 0 |
| 1 | Graduation | 46344 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 2 | True | 27 | 4021 | 11.02 | 0 | 0 | 1 | 1 | 0 |
| 2 | Graduation | 71613 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 0 | False | 776 | 4220 | 11.56 | 1 | 0 | 0 | 0 | 1 |
| 3 | Graduation | 26646 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 1 | True | 53 | 4047 | 11.09 | 1 | 0 | 1 | 0 | 1 |
| 4 | PhD | 58293 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | ... | 1 | True | 422 | 4069 | 11.15 | 1 | 0 | 1 | 0 | 1 |
5 rows × 36 columns
#df['cluster_labels'] = kmeans.labels_
df['labels'] = kmeans.labels_
# Bar graph of total instances in each label
#fig = df['cluster_labels'].value_counts().plot.bar()
fig = df['labels'].value_counts().plot.bar()
# Set axes to horizontal
plt.tick_params(axis='x', labelrotation=0)
# Chart titles
plt.title('Cluster Size')
plt.xlabel('Label')
plt.ylabel('# of Instances')
plt.show()
top_df = df[['Total', 'Income', 'Meat', 'Wines', 'Fish', 'labels']].copy()
Divide DataFrame¶
Create three smaller DataFrames; one for each label.
# Group 0
label0 = top_df[top_df['labels'] == 0].copy()
label0.drop(columns=['labels'], inplace=True)
label0.reset_index(inplace=True, drop=True)
# Group 1
label1 = top_df[top_df['labels'] == 1].copy()
label1.drop(columns=['labels'], inplace=True)
label1.reset_index(inplace=True, drop=True)
# Group 2
label2 = top_df[top_df['labels'] == 2].copy()
label2.drop(columns=['labels'], inplace=True)
label2.reset_index(inplace=True, drop=True)
label0.describe()
| Total | Income | Meat | Wines | Fish | |
|---|---|---|---|---|---|
| count | 537.000000 | 537.000000 | 537.000000 | 537.000000 | 537.000000 |
| mean | 851.821229 | 63039.093110 | 186.294227 | 506.962756 | 34.595903 |
| std | 291.831219 | 10967.238927 | 143.604138 | 280.332942 | 34.783534 |
| min | 277.000000 | 2447.000000 | 3.000000 | 1.000000 | 0.000000 |
| 25% | 615.000000 | 56559.000000 | 94.000000 | 293.000000 | 10.000000 |
| 50% | 819.000000 | 62845.000000 | 149.000000 | 478.000000 | 23.000000 |
| 75% | 1038.000000 | 69508.000000 | 238.000000 | 656.000000 | 52.000000 |
| max | 1829.000000 | 113734.000000 | 1725.000000 | 1462.000000 | 188.000000 |
label1.describe()
| Total | Income | Meat | Wines | Fish | |
|---|---|---|---|---|---|
| count | 1174.000000 | 1174.000000 | 1174.000000 | 1174.000000 | 1174.000000 |
| mean | 134.417376 | 36232.798126 | 29.601363 | 67.324532 | 8.189097 |
| std | 129.342700 | 12727.878116 | 32.385643 | 88.793809 | 12.887209 |
| min | 5.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 42.250000 | 27100.000000 | 8.000000 | 9.000000 | 2.000000 |
| 50% | 75.000000 | 36425.500000 | 17.000000 | 27.000000 | 4.000000 |
| 75% | 197.250000 | 45578.250000 | 38.750000 | 88.750000 | 11.000000 |
| max | 595.000000 | 70844.000000 | 217.000000 | 516.000000 | 150.000000 |
label2.describe()
| Total | Income | Meat | Wines | Fish | |
|---|---|---|---|---|---|
| count | 490.000000 | 490.000000 | 490.000000 | 490.000000 | 490.000000 |
| mean | 1468.520408 | 75909.848980 | 467.338776 | 658.081633 | 111.514286 |
| std | 388.441036 | 10413.687668 | 228.606679 | 320.821315 | 64.355899 |
| min | 449.000000 | 31907.000000 | 64.000000 | 33.000000 | 0.000000 |
| 25% | 1170.000000 | 70046.250000 | 282.000000 | 398.000000 | 58.250000 |
| 50% | 1457.500000 | 76504.500000 | 430.000000 | 605.500000 | 103.500000 |
| 75% | 1722.000000 | 82343.500000 | 628.000000 | 900.500000 | 160.000000 |
| max | 2525.000000 | 105471.000000 | 984.000000 | 1493.000000 | 259.000000 |
sns.pairplot(top_df, hue='labels', corner=True)
plt.show()
Initial Findings¶
- Unbalanced
- label 0 is as much as labels 1 and 2 combined
- Distinct clusters
- it seems like the clusters are separated as low, medium, and high
#sns.histplot(data = top_df, hue='labels')
# Histogram
for col in top_df.columns:
sns.histplot(top_df[col])
plt.show()
top_df.head()
| Total | Income | Meat | Wines | Fish | labels | |
|---|---|---|---|---|---|---|
| 0 | 1617 | 58138 | 546 | 635 | 172 | 2 |
| 1 | 27 | 46344 | 6 | 11 | 2 | 1 |
| 2 | 776 | 71613 | 127 | 426 | 111 | 0 |
| 3 | 53 | 26646 | 20 | 11 | 10 | 1 |
| 4 | 422 | 58293 | 118 | 173 | 46 | 0 |
TO DO¶
- Scale the data with
MinMaxScaler - Check out each histogram in greater detail
- Use Plotly to make the graph interactive
columns_to_scale = ['Income', 'Meat', 'Wines', 'Fish', 'Total']
df_to_scale = top_df[columns_to_scale].copy()
scaled_data = scaler.fit_transform(df_to_scale)
df_scaled = pd.DataFrame(scaled_data, columns=columns_to_scale)
df_scaled = pd.DataFrame(scaled_data, columns=columns_to_scale)
top_df.head()
| Total | Income | Meat | Wines | Fish | labels | |
|---|---|---|---|---|---|---|
| 0 | 1617 | 58138 | 546 | 635 | 172 | 2 |
| 1 | 27 | 46344 | 6 | 11 | 2 | 1 |
| 2 | 776 | 71613 | 127 | 426 | 111 | 0 |
| 3 | 53 | 26646 | 20 | 11 | 10 | 1 |
| 4 | 422 | 58293 | 118 | 173 | 46 | 0 |
fig = sns.histplot(top_df, x='Wines', hue='labels', multiple="stack", palette="Paired")
# Improve the grid and layout
plt.grid(True, linestyle='--', alpha=0.6)
plt.title('Comparison of Wines by labels', fontsize=16)
# Show the plot
plt.tight_layout() # Automatically adjusts subplots for better spacing
plt.show()
total = top_df['Total'].copy()
meat = top_df['Meat'].copy()
wine = top_df['Wines'].copy()
# Plot histograms with KDE and better aesthetics
plt.figure(figsize=(10, 6)) # Increase figure size for better visualization
sns.histplot(total, color='royalblue', label='Total', stat='density', alpha=0.6, bins=30)
sns.histplot(meat, color='darkorange', label='Meat', stat='density', alpha=0.6, bins=30)
sns.histplot(wine, color='forestgreen', label='Wines', stat='density', alpha=0.6, bins=30)
# Add a legend to differentiate the histograms
plt.legend(title='Categories')
# Title and labels for the axes
plt.title('Comparison of Distributions: Total, Meat, and Wines', fontsize=16)
plt.xlabel('Value', fontsize=12)
plt.ylabel('Density', fontsize=12)
# Improve the grid and layout
plt.grid(True, linestyle='--', alpha=0.6)
# Show the plot
plt.tight_layout() # Automatically adjusts subplots for better spacing
plt.show()